Gradient descent for machine learning - a quick introduction

In this notebook we are going to use gradient descent to estimate the parameters of a model. In this case we are going to compute the parameters to convert temperatures fom Farenheit to Kelvin.

Given that we are approaching this from a machine learning perspective, we are going to determine our scaling factor and offset value by gradient descent, given some example temperatures on both scales. In other words, we are going to learn the parameters from the data.

First, some imports:


In [1]:
import numpy as np
import pandas as pd

Our data set:


In [2]:
INDEX = ['Boiling point of He',
         'Boiling point of N',
         'Melting point of H2O',
         'Body temperature',
         'Boiling point of H2O']

X = np.array([-452.1, -320.4, 32.0, 98.6, 212.0])
Y = np.array([4.22, 77.36, 273.2, 310.5, 373.2])

Show our data set in a table:


In [3]:
pd.DataFrame(np.stack([X, Y]).T, index=INDEX,
             columns=['Fahrenheit ($x$)', 'Kelvin ($y$)'])


Out[3]:
Fahrenheit ($x$) Kelvin ($y$)
Boiling point of He -452.1 4.22
Boiling point of N -320.4 77.36
Melting point of H2O 32.0 273.20
Body temperature 98.6 310.50
Boiling point of H2O 212.0 373.20

Model - linear regression

Temperatures can be converted using a linear model of the form $y=ax+b$.

$x$ and $y$ are samples in our dataset; $X=\{x_0...X_N\}$ and $Y=\{y_0...Y_N\}$ while $a$ and $b$ are the model parameters.

Initialise model

Lets try initialising the parameters $a$ randomly and $b$ to 0 and see what it predicts:


In [4]:
# Lets initialise `a` to between 1.0 and 2.0; it is therefore impossible
# for it to choose a (nearly) correct value at the start, forcing our model to do some work.
a = np.random.uniform(1.0, 2.0, size=())
b = 0.0

print('a={}, b={}'.format(a, b))

Y_pred = X * a + b
pd.DataFrame(np.stack([X, Y, Y_pred]).T, index=INDEX,
             columns=['Fahrenheit ($x$)', 'Kelvin ($y$)', '$y_{pred}$'])


a=1.58879502645, b=0.0
Out[4]:
Fahrenheit ($x$) Kelvin ($y$) $y_{pred}$
Boiling point of He -452.1 4.22 -718.294231
Boiling point of N -320.4 77.36 -509.049926
Melting point of H2O 32.0 273.20 50.841441
Body temperature 98.6 310.50 156.655190
Boiling point of H2O 212.0 373.20 336.824546

How good is our guess?

To estimate the accuracy of our model, lets compute the squared error. We use the squared error since its value will always be positive and larger errors will have a greater cost due to being squared:


In [5]:
sqr_err = (Y_pred - Y)**2

pd.DataFrame(np.stack([X, Y, Y_pred, sqr_err]).T, index=INDEX,
             columns=['Fahrenheit ($x$)', 'Kelvin ($y$)', '$y_{pred}$', 'squared err ($\epsilon$)'])


Out[5]:
Fahrenheit ($x$) Kelvin ($y$) $y_{pred}$ squared err ($\epsilon$)
Boiling point of He -452.1 4.22 -718.294231 522026.814658
Boiling point of N -320.4 77.36 -509.049926 343876.601867
Melting point of H2O 32.0 273.20 50.841441 49443.328829
Body temperature 98.6 310.50 156.655190 23668.225685
Boiling point of H2O 212.0 373.20 336.824546 1323.173682

Reducing the error

We reduce the error by taking the gradient of the squared error with respect to the parameters $a$ and $b$ and iteratively modifying the values of $a$ and $b$ in the direction of the negated gradient.

Lets determine the expressions for the gradient of the squared error $\epsilon$ with respect to $a$ and $b$:

$\epsilon_i = (ax_i + b - y_i)^2 = a^2x_i^2 + 2abx_i - 2ax_iy_i + b^2 + y_i^2 - 2by_i$

In terms of $a$: $\epsilon_i = a^2x_i^2 + a(2bx_i - 2x_iy_i) + b^2 + y_i^2 - 2by_i$

So ${d\epsilon_i\over{da}} = 2ax_i^2 + 2bx_i - 2x_iy_i$

In terms of $b$: $\epsilon = b^2 + b(2ax_i - 2y_i) + a^2x_i^2- 2ax_iy_i - 2by_i$

So ${d\epsilon_i\over{db}} = 2b + 2ax_i - 2y_i$

The above expressions apply to single samples only. To apply them to all of our 5 data points, we need to use the mean squared error. The mean squared error is the sum of the individual errors divided by the number of data points $N$. The derivative of the mean squared error w.r.t. $a$ and $b$ will also be the sum of the individual derivatives, divided by $N$.

Gradient descent

Gradient descent is performed iteratively; each parameter is modified independently as so:

$a' = a - \gamma {d\epsilon_i\over{da}}$

$b' = b - \gamma {d\epsilon_i\over{db}}$

where $\gamma$ is the learning rate.

Implementation

We now have all we need to define some gradient descent helper functions:


In [6]:
def iterative_gradient_descent_step(a, b, lr):
    """
    A single gradient descent iteration
    
    :param a: current value of `a`
    :param b: current value of `b`
    :param lr: learning rate
    
    :return: a tuple `(a_next, b_next)` that are the values of `a` and `b` after the iteration.
    """
    # Derivative of a and b w.r.t. epsilon:
    da_depsilon = (2 * a * X**2 + 2 * b * X - 2 * X * Y).mean()
    db_depsilon = (2 * b + 2 * a * X - 2 * Y).mean()
    
    # Gradient descent:
    a = a - da_depsilon * lr
    b = b - db_depsilon * lr
    
    # Return new values
    return a, b


def state_as_table(a, b):
    """
    Helper function to generate a Pandas DataFrame showing the current state, including predicted values and errors
    
    :param a: current value of `a`
    :param b: current value of `b`
    
    :return: tuple `(df, mean_sqr_err)` where `df` is the Pandas DataFrame and `sqr_err` is the mean squared error
    """
    Y_pred = X * a + b
    sqr_err = (Y_pred - Y)**2

    df = pd.DataFrame(np.stack([X, Y, Y_pred, sqr_err]).T, index=INDEX,
                      columns=['Fahrenheit ($x$)', 'Kelvin ($y$)', '$y_{pred}$', 'squared err ($\epsilon$)'])    
    return df, sqr_err.mean()

Define learning rate and show initial state:


In [7]:
LEARNING_RATE = 0.00001
N_ITERATIONS = 50000

df, mean_sqr_err = state_as_table(a, b)
print('a = {}, b = {}, mean sqr. err. = {}'.format(a, b, mean_sqr_err))
df


a = 1.58879502645, b = 0.0, mean sqr. err. = 188067.628944
Out[7]:
Fahrenheit ($x$) Kelvin ($y$) $y_{pred}$ squared err ($\epsilon$)
Boiling point of He -452.1 4.22 -718.294231 522026.814658
Boiling point of N -320.4 77.36 -509.049926 343876.601867
Melting point of H2O 32.0 273.20 50.841441 49443.328829
Body temperature 98.6 310.50 156.655190 23668.225685
Boiling point of H2O 212.0 373.20 336.824546 1323.173682

Gradient descent

Run this cell repeatedly to see gradient descent in action:


In [31]:
for i in xrange(N_ITERATIONS):
    a, b = iterative_gradient_descent_step(a, b, LEARNING_RATE)

df, mean_sqr_err = state_as_table(a, b)
print('a = {}, b = {}, mean sqr. err. = {}'.format(a, b, mean_sqr_err))
df


a = 0.555810262011, b = 255.484566228, mean sqr. err. = 0.0131644123508
Out[31]:
Fahrenheit ($x$) Kelvin ($y$) $y_{pred}$ squared err ($\epsilon$)
Boiling point of He -452.1 4.22 4.202747 0.000298
Boiling point of N -320.4 77.36 77.402958 0.001845
Melting point of H2O 32.0 273.20 273.270495 0.004969
Body temperature 98.6 310.50 310.287458 0.045174
Boiling point of H2O 212.0 373.20 373.316342 0.013535

Evaluation

The formula for conversion from Farenheit to Kelvin is:

$T_K = {5\over9}T_F + 255.372$

Therefore:

$a = 0.556$

$b = 255.372$

If the above cell was run enough times, $a$ and $b$ should have reached values that are close to the above ideal values (some error is expected as the input data has some small rounding errors).

Conclusions

There are some problems with the model above; namely that a very low learning rate and huge number of iterations were required. This is because using a larger learning rate causes the parameters to take huge steps in one direction or another, often causing them to oscillate between negative and positive values with rapidly increasing magnitudes, after which the model 'explodes'.

This would be addressed by standardising the data in X and Y by subtracting the mean and dividing by the standard deviation. This would allow a much higher learning rate and smaller number of iterations to suffice.

That said, this notebook demonstrates the use of gradient descent to train a simple linear regression model; I hope you have found it helpful.